Implement GROUPING aggregate function (following Postgres behavior.) #12565

bgjackma · 2024-09-21T06:10:25Z

Which issue does this PR close?

Closes #5647.

Rationale for this change

Implements the GROUPING function as per Postgres.

https://www.postgresql.org/docs/15/functions-aggregate.html#FUNCTIONS-GROUPING-TABLE

This is in contrast to other implementations including Databricks and Oracle where GROUPING takes only one column and there is a GROUPING_ID function that yields a similar bitfield.

What changes are included in this PR?

Implement the aggregate function in the Physical Planning stage.

Are these changes tested?

A few unit tests and an integration test provided by @JasonLi-cn in a previous unfinished PR. May add more.

Are there any user-facing changes?

Slight change to definition of unimplemented function.
Implementation of that function.

eejbyfeldt · 2024-09-21T08:46:39Z

datafusion/functions-aggregate/src/grouping.rs

+        // The PhysicalExprs of grouping_exprs must be Column PhysicalExpr. Because if
+        // the group by PhysicalExpr in SQL is non-Column PhysicalExpr, then there is
+        // a ProjectionExec before AggregateExec to convert the non-column PhysicalExpr
+        // to Column PhysicalExpr.
+        let column_index =
+            |expr: &Arc<dyn PhysicalExpr>| match expr.as_any().downcast_ref::<Column>() {
+                Some(column) => Ok(column.index()),
+                None => internal_err!("Grouping doesn't support expr: {}", expr),
+            };


This is only true when one enabled the optimizer rule CommonSubexprEliminate . Does not seems like a acceptable to depend on optimizer rules for correctness/basic support.

Can we look for equal PhysicalExprs?

The Postgres docs imply they do ~text comparison but I'm not sure how accessible that info is at this layer.

eejbyfeldt · 2024-09-21T09:01:05Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+    // GROUPING is a special fxn that exposes info about group organization
+    if let Some(grouping) = agg_expr.fun().inner().as_any().downcast_ref::<Grouping>() {
+        let args = agg_expr.all_expressions().args;
+        return grouping.create_grouping_accumulator(&args, &group_by.expr);
+    }


If we need special handling like this it seems to me that we should consider just making Grouping a build in.

Or we should probably make it more generic so it can be used to implement some other function. But since the input is is just the bitmaks and the output is the same. I wonder if there are any conceivable functions that could not just be implemented as a transformation on a builtin grouping function.

It's kind of in a weird place, it's sort of not a real aggregation function but instead a way to leak metadata. That might be a reason to make it a built-in.

Do you have ideas about how and when to go about doing that?

There's another function called GROUP_ID (not to be confused with GROUPING_ID) which disambiguates duplicate rows, it might be relevant.

It's kind of in a weird place, it's sort of not a real aggregation function but instead a way to leak metadata. That might be a reason to make it a built-in.

Do you have ideas about how and when to go about doing that?

One way might be that we expose the grouping_id column used in #12571 and implement the function as transformation on that. This should be possible as that column should "leak" the needed metadata. This is what was proposed in #5749

I pushed an initial implemenation of this here: #12704

I think someone with more experince with this project should decide what is the best way forward.

alamb · 2024-09-24T14:51:04Z

Looks like there is a minor clippy failure on this PR

comphead

Thanks @bgjackma for your contribution. Great PR and the testing is awesome.
I would probably add user documentation, examples and benchmarks. But it can be also done as followup PRs

comphead · 2024-09-30T20:44:37Z

datafusion/functions-aggregate/src/grouping.rs

+        grouping_args: &[Arc<dyn PhysicalExpr>],
+        group_exprs: &[(Arc<dyn PhysicalExpr>, String)],
+    ) -> Result<Box<dyn GroupsAccumulator>> {
+        if grouping_args.len() > 32 {


lets have it as a const

comphead · 2024-09-30T20:45:13Z

datafusion/functions-aggregate/src/grouping.rs

+            };
+        let group_by_columns: Result<Vec<_>> =
+            group_exprs.iter().map(|(e, _)| column_index(e)).collect();
+        let group_by_columns = group_by_columns?;


this can be 1 liner?

comphead · 2024-09-30T20:46:41Z

datafusion/functions-aggregate/src/grouping.rs

+struct GroupingAccumulator {
+    // Grouping ID value for each group
+    grouping_ids: Vec<u32>,
+    // Indices of GROUPING arguments as they appear in the GROUPING SET


can we have more details or example on indices?

comphead · 2024-09-30T20:47:49Z

datafusion/functions-aggregate/src/grouping.rs

+}
+
+impl GroupingAccumulator {
+    fn mask_to_id(&self, mask: &[bool]) -> Result<u32> {


Please add more description on this method, how it changes the mask

comphead · 2024-09-30T20:48:46Z

datafusion/functions-aggregate/src/grouping.rs

+        _opt_filter: Option<&BooleanArray>,
+        total_num_groups: usize,
+    ) -> Result<()> {
+        assert_eq!(values.len(), 1, "single argument to merge_batch");


so we always expect only 1 array ?

comphead · 2024-09-30T20:49:18Z

datafusion/functions-aggregate/src/grouping.rs

+            expr_indices: vec![5],
+        };
+        let res = grouping.mask_to_id(&[false]);
+        assert!(res.is_err())


you may want to check the error message as well

comphead · 2024-09-30T20:50:01Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -1169,7 +1176,7 @@ pub(crate) fn evaluate_group_by(
        .groups
        .iter()
        .map(|group| {
-            group
+            let v = group


lets have more meaningful name?

comphead · 2024-09-30T20:50:44Z

datafusion/physical-plan/src/aggregates/row_hash.rs

 ) -> Result<Box<dyn GroupsAccumulator>> {
+    // GROUPING is a special fxn that exposes info about group organization


Suggested change

// GROUPING is a special fxn that exposes info about group organization

// GROUPING is a special function that exposes info about group organization

?

comphead · 2024-09-30T20:51:07Z

datafusion/physical-plan/src/aggregates/row_hash.rs

@@ -870,6 +883,7 @@ impl GroupedHashAggregateStream {
                | AggregateMode::SinglePartitioned => output.push(acc.evaluate(emit_to)?),
            }
        }
+        debug!("Output: {:?}", output);


Suggested change

debug!("Output: {:?}", output);

comphead · 2024-09-30T20:51:41Z

datafusion/physical-plan/src/aggregates/row_hash.rs

-        })?;
+        let mut output = group_values
+            .first()
+            .map(|gs| gs.values.clone())


lets have better naming?

bgjackma · 2024-10-03T01:00:58Z

Thanks for the review, @comphead but I think #12704 is a better solution so I'm going to close in favor of that.

Implement GROUPING aggregate function (following Postgres behavior.)

9a2d8cf

github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) functions labels Sep 21, 2024

bgjackma marked this pull request as ready for review September 21, 2024 06:11

eejbyfeldt reviewed Sep 21, 2024

View reviewed changes

Satisfy Clippy

9c0ed8a

bgjackma force-pushed the bgjackma/main branch from 382bb23 to 9c0ed8a Compare September 25, 2024 19:01

comphead reviewed Sep 30, 2024

View reviewed changes

bgjackma closed this Oct 3, 2024

eejbyfeldt mentioned this pull request Oct 13, 2024

feat: Implement grouping function using grouping id #12704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Implement GROUPING aggregate function (following Postgres behavior.) #12565

bgjackma commented Sep 21, 2024 •

edited

Loading

eejbyfeldt Sep 21, 2024

bgjackma Sep 21, 2024

eejbyfeldt Sep 21, 2024

bgjackma Sep 21, 2024

eejbyfeldt Sep 25, 2024

eejbyfeldt Oct 2, 2024

alamb commented Sep 24, 2024

comphead left a comment

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

comphead Sep 30, 2024

bgjackma commented Oct 3, 2024

		) -> Result<Box<dyn GroupsAccumulator>> {
		// GROUPING is a special fxn that exposes info about group organization

	// GROUPING is a special fxn that exposes info about group organization
	// GROUPING is a special function that exposes info about group organization

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Conversation

bgjackma commented Sep 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 24, 2024

comphead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgjackma commented Oct 3, 2024

bgjackma commented Sep 21, 2024 •

edited

Loading